feat: Add judge evaluation support to agent graphs#142
Merged
jsonbailey merged 10 commits intomainfrom Apr 28, 2026
Merged
Conversation
Adds per-node judge evaluation to agent graph execution. Each AIAgentConfig now carries a pre-built Evaluator (mirroring AICompletionConfig) that the provider-specific AgentGraphRunner invokes after each node's model response. Results are tracked via the same AIConfigTracker used for that node's LLM metrics, ensuring evaluation data is correlated correctly. Key changes: - New Evaluator class coordinating multiple judges; evaluate() returns an asyncio Task so evaluation fires immediately and is awaited in flush() - AIAgentConfig and AICompletionConfig carry an eager evaluator (kw_only field) - LangGraph runner stores per-node eval tasks in _pending_eval_tasks and flushes them via the callback handler's async flush() method - OpenAI runner fires judge evaluation at handoff and final-segment points - client._build_evaluator() handles empty/None judge config via Evaluator.noop() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
keelerm84
approved these changes
Apr 24, 2026
- Remove quotes from asyncio.Task return type in Evaluator.evaluate() - Update ModelResponse.evaluations type to asyncio.Task[List[JudgeResult]] - Forward default_ai_provider to __evaluate_agent in create_agent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_pending_eval_tasks was keyed by node key, so repeated visits (e.g. cycles or tool loops) would silently overwrite earlier eval tasks. Changed to Dict[str, List[Task]] with setdefault/append so all invocations are tracked. flush() now iterates the full list per node. Also wraps the long __evaluate_agent call in create_agent to satisfy E501. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace asyncio.create_task fire-and-forget with proper task collection and awaiting in both OpenAI and LangGraph runners, ensuring judge results are tracked reliably. Use ContextVar in LangGraph runner to isolate pending eval task state across concurrent run() calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_ai_provider - Remove if-tracker guards in both runners since create_tracker is always set on enabled graphs (disabled graphs are filtered before runner creation), also fixing token_usage NameError when tracker=None - Forward variables through _build_evaluator to _initialize_judges so judge templates can interpolate user-provided variables - Add default_ai_provider param to agent_graph() and forward it to __evaluate_agent so graph node evaluators use the correct provider; propagate from create_agent_graph() as well Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Redesign ManagedModel._track_judge_results to call evaluator.evaluate() internally and attach tracking via add_done_callback, returning the task so the reference is held by ModelResponse.evaluations — no GC risk - Warn instead of silently dropping eval tasks when the LangGraph ContextVar is unexpectedly unset in a node's execution context - Make AgentGraphDefinition.create_tracker a required parameter; all production and test call sites already supply it, and this matches the invariant that runners only execute on enabled (always-tracked) graphs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both branches independently added evaluator/judge logic (this branch) and root-level tools map support (main). Conflicts in _completion_config and __evaluate_agent resolved by keeping both changes. Parameter order swap for track_metrics_of_async auto-resolved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…iew items - Fix AgentGraphResult.evaluations type from Optional[List[Any]] to Optional[List[JudgeResult]] - Populate evaluations in both LangGraph and OpenAI runners with all judge results - Remove stray `if tracker:` guard in OpenAI _handle_handoff (tracker is always set) - Add comment documenting why output_text is empty at handoff time in OpenAI runner - flush() now returns List[JudgeResult] instead of None Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2a15009. Configure here.
- Add `from __future__ import annotations` to evaluator.py so the self-referential `-> Evaluator` return type does not need quoting - Log a warning when a judge fails to initialize in _initialize_judges instead of silently swallowing the exception Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OpenAI Agents SDK does not expose a node's text output at handoff time, making it impossible to evaluate intermediate nodes against real output. Rather than evaluating against an empty string, remove evaluation support from the OpenAI runner entirely until the SDK provides a suitable API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Evaluatorclass that coordinates per-node judge evaluation;evaluate()returns anasyncio.Taskso evaluation fires immediately and is awaited before the graph run returnsAIAgentConfig(andAICompletionConfig) now carry a pre-builtEvaluatoras akw_onlydataclass field, constructed eagerly inclient._build_evaluator()LangGraphAgentGraphRunnerstores per-node eval tasks in_pending_eval_tasksduring node execution;LangChainCallbackHandler.flush()(now async) awaits them and callstrack_judge_resultvia the sameAIConfigTrackerused for that node's LLM metricsOpenAIAgentGraphRunnerfires judge evaluation at handoff and final-segment points, tracked via the node's config trackerEvaluator.noop()provides a null-object default so nodes without ajudgeConfigurationrequire no special handlingTest plan
make test— 248 tests across 3 packages)make lint)langgraph-multi-agent-exampleorchat-judge-exampleviahello-python-aipointing at this worktree and verify judge events appear in the LD events streamCloses AIC-2267
🤖 Generated with Claude Code
Note
Medium Risk
Introduces asynchronous judge-evaluation execution and wires results into both
ManagedModeland agent-graph runners, changing result types and tracker flushing behavior. Risk is moderate due to new concurrency/task handling and API surface changes aroundcreate_trackerandevaluationsfields.Overview
Adds a new
Evaluatorabstraction and threads it throughAICompletionConfig/AIAgentConfigso judge evaluations can be kicked off per invocation and tracked automatically.Updates
ManagedModel.invoke()to start evaluation via the config’sevaluator, attach a completion callback to emittrack_judge_result, and changesModelResponse.evaluationsto carry anasyncio.Task(whileAgentGraphResultnow includes collected judge results).Extends LangGraph execution to schedule per-node evaluation tasks during node invocation, store them per-run using a
ContextVar, and makeLDMetricsCallbackHandler.flush()async so it can await tasks, track successful judge results per node, and return all results.Refactors judge initialization in
LDAIClientto build evaluators eagerly (including newdefault_ai_providerplumbing), removes async judge setup fromcreate_model(), and tightensAgentGraphDefinition.create_trackerto be required; OpenAI agent-graph tracking is aligned to always use a graph tracker and now returns token usage inLDAIMetrics.Reviewed by Cursor Bugbot for commit e2f5b93. Bugbot is set up for automated code reviews on this repo. Configure here.